Parallel and Cross-Language Computing

A Hands-On Workshop for Empirical Researchers

Nelson Areal

NIPE/University of Minho

Miguel Portela

NIPE/University of Minho/Banco de Portugal

December 14, 2025

Plan for the session

Discuss some strategies to optimize code performance:

  • Optimize the algorithm
  • Use parallel computing
  • Use a lower-level language (C++, Julia, etc.)
  • Use the GPU

We are going to illustrate these strategies using two examples.

Some initial thoughts on optimization

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Knuth, D. E. (1974). Structured programming with go to statements. ACM Computing Surveys (CSUR), 6(4), 261-301.

Ready made tools

For most data manipulation tasks, there are already great alternatives:

These tools are optimized for performance and can handle large datasets efficiently.

There is no need to reinvent the wheel.

How to identify bottlenecks?

Use profiling tools to identify bottlenecks in your code, in R consider using the profvis package.

And then focus on optimizing the parts of the code that are most time-consuming.

The strategies we are going to discuss can be applied if the bottleneck is not yet programmed in an optimized way or in parallel.

Two illustrations

We are going to illustrate some of the strategies using two examples:

  • American option valuation
  • High-dimensional fixed effect models

Valuing an american option

Option valuation with binomial trees

We need the following parameters to value an option using a binomial tree:

  • Current underlying asset price: \(S_0\)
  • Strike price: \(K\)
  • Time to maturity: \(T\) in years
  • Risk-free interest rate: \(r\) continuously compounded
  • Dividend yield: \(q\) continuously compounded
  • Volatility of the underlying asset: \(\sigma\)
  • Number of time steps in the binomial tree: \(N\)
  • Type of option: Call or Put

We are going to assume an American put option for this example.

Option valuation with binomial trees

Once we have the parameters, we can proceed to build the binomial tree and value the option.

We will be using the CRR (Cox-Ross-Rubinstein) parameters:

  • \(u = e^{\sigma\sqrt{\Delta t}}\)
  • \(d = e^{-\sigma\sqrt{\Delta t}}\)
  • \(\Delta t = \frac{T}{N}\)
  • \(p = \frac{e^{(r - q) \Delta t} - d}{u - d}\)

Option valuation with binomial trees

Now we need to follow these steps:

  1. Build the underlying asset price tree
  2. Calculate the option values at each node, starting from maturity and working backwards to the present.
  3. Determine the option price at the initial node.

Option valuation with binomial trees

Build the underlying asset price tree

Option valuation with binomial trees

Calculating option values at maturity

Option valuation with binomial trees

Calculating option values at all previous nodes

Option valuation with binomial trees

Calculating option values at all previous nodes

Binomial trees – Derivatives

R implementation

R implementation

# This function implements a bionomial tree algorithm for
# valuing American put options using Cox-Ross-Rubinstein parameters.

ap_naive <- function(S0, K, r, q, tt, sigma, steps) {
  # S0: Initial stock price
  # K: Strike price
  # r: Risk-free interest rate compounded continuously
  # q: Dividend yield compounded continuously
  # tt: Time to maturity in years
  # sigma: Volatility of the underlying stock
  # steps: Number of time steps in the binomial tree

  dt <- tt / steps # time between steps
  u <- exp(sigma * sqrt(dt)) # Up factor
  d <- exp(-sigma * sqrt(dt)) # Down factor
  p <- (exp((r - q) * dt) - d) / (u - d) # Risk-neutral probability

  # Initialize matrices for asset prices and option values
  # Rows represent nodes (i), columns represent time steps (j)
  asset_prices <- matrix(NA, nrow = steps + 1, ncol = steps + 1)
  option_values <- matrix(NA, nrow = steps + 1, ncol = steps + 1)

  # Fill in asset prices for all nodes in the tree (lower triangular)
  for (j in 1:(steps + 1)) {
    for (i in (steps + 2 - j):(steps + 1)) {
      asset_prices[i, j] <- S0 * u^(steps + 1 - i) * d^(i + j - steps - 2)
    }
  }

  # Initialize option values at maturity (last column)
  for (i in 1:(steps + 1)) {
    option_values[i, steps + 1] <- max(K - asset_prices[i, steps + 1], 0)
  }

  # Backward induction to calculate option price at earlier nodes
  # i - node, j - time step
  for (j in steps:1) {
    for (i in (steps + 2 - j):(steps + 1)) {
      # Calculate continuation value
      option_values[i, j] <-
        (p * option_values[i - 1, j + 1] + (1 - p) * option_values[i, j + 1]) *
        exp(-r * dt)

      # Check for early exercise
      intrinsic_value <- K - asset_prices[i, j]
      option_values[i, j] <- max(option_values[i, j], intrinsic_value)
    }
  }

  return(option_values[steps + 1, 1]) # Option price at the root node
}

Vectors instead of matrices

# This function implements an optimized binomial tree algorithm for
# valuing American put options using Cox-Ross-Rubinstein parameters.
# Optimization:
# Uses vectors instead of matrices and calculates asset prices on the fly.

ap_v1 <- function(S0, K, r, q, tt, sigma, steps) {
  # S0: Initial stock price
  # K: Strike price
  # r: Risk-free interest rate compounded continuously
  # q: Dividend yield compounded continuously
  # tt: Time to maturity in years
  # sigma: Volatility of the underlying stock
  # steps: Number of time steps in the binomial tree

  dt <- tt / steps # time between steps
  u <- exp(sigma * sqrt(dt)) # Up factor
  d <- exp(-sigma * sqrt(dt)) # Down factor
  p <- (exp((r - q) * dt) - d) / (u - d) # Risk-neutral probability
  disc <- exp(-r * dt) # Discount factor

  # Initialize option values at maturity
  # Vector size is steps + 1 (for nodes 0 to steps)
  option_values <- numeric(steps + 1)

  # Calculate option values at maturity
  for (i in 1:(steps + 1)) {
    # At maturity, node i has (steps + 1 - i) up moves and (i - 1) down moves
    asset_price <- S0 * u^(steps + 1 - i) * d^(i - 1)
    option_values[i] <- max(K - asset_price, 0)
  }

  # Backward induction to calculate option price at earlier nodes
  for (j in steps:1) {
    # At time step j, we have j nodes
    for (i in 1:j) {
      # Calculate continuation value
      option_values[i] <- (p *
        option_values[i] +
        (1 - p) * option_values[i + 1]) *
        disc

      # Calculate asset price for early exercise check
      # At time step j-1, node i has (j - i) up moves and (i - 1) down moves
      asset_price <- S0 * u^(j - i) * d^(i - 1)

      # Check for early exercise
      intrinsic_value <- K - asset_price
      option_values[i] <- max(option_values[i], intrinsic_value)
    }
  }

  return(option_values[1]) # Option price at the root node
}

Vectors instead of matrices without the first loop

# This function implements an optimized binomial tree algorithm for
# valuing American put options using Cox-Ross-Rubinstein parameters.
# Optimization:
# Uses vectors instead of matrices and calculates asset prices on the fly
# and removes one loop in relation to ap_v1.

ap_v2 <- function(S0, K, r, q, tt, sigma, steps) {
  # S0: Initial stock price
  # K: Strike price
  # r: Risk-free interest rate compounded continuously
  # q: Dividend yield compounded continuously
  # tt: Time to maturity in years
  # sigma: Volatility of the underlying stock
  # steps: Number of time steps in the binomial tree

  dt <- tt / steps # time between steps
  u <- exp(sigma * sqrt(dt)) # Up factor
  d <- exp(-sigma * sqrt(dt)) # Down factor
  p <- (exp((r - q) * dt) - d) / (u - d) # Risk-neutral probability
  disc <- exp(-r * dt) # Discount factor

  # Vector to store option values
  option_values <- numeric(steps + 1)

  # Backward induction from maturity to root
  for (j in (steps + 1):1) {
    num_nodes <- if (j == steps + 1) steps + 1 else j

    for (i in 1:num_nodes) {
      # Calculate asset price at this node
      asset_price <- S0 * u^(j - i) * d^(i - 1)

      if (j == steps + 1) {
        # At maturity: option value is intrinsic value
        option_values[i] <- max(K - asset_price, 0)
      } else {
        # Before maturity: calculate continuation value and check early exercise
        continuation_value <- (p *
          option_values[i] +
          (1 - p) * option_values[i + 1]) *
          disc
        intrinsic_value <- K - asset_price
        option_values[i] <- max(continuation_value, intrinsic_value)
      }
    }
  }

  return(option_values[1]) # Option price at the root node
}

Similar to v2 but with defensive programming

# This function implements an optimized binomial tree algorithm for
# valuing American put options using Cox-Ross-Rubinstein parameters.
# In relation to ap_v2 includes defensive programming with input validation.

ap_v3 <- function(S0, K, r, q, tt, sigma, steps) {
  # S0: Initial stock price
  # K: Strike price
  # r: Risk-free interest rate compounded continuously
  # q: Dividend yield compounded continuously
  # tt: Time to maturity in years
  # sigma: Volatility of the underlying stock
  # steps: Number of time steps in the binomial tree

  # Input validation
  # Check for missing arguments
  if (
    missing(S0) ||
      missing(K) ||
      missing(r) ||
      missing(q) ||
      missing(tt) ||
      missing(sigma) ||
      missing(steps)
  ) {
    stop("All arguments must be provided")
  }

  # Check for NA or NULL values
  if (
    any(is.na(c(S0, K, r, q, tt, sigma, steps))) ||
      any(is.null(c(S0, K, r, q, tt, sigma, steps)))
  ) {
    stop("Arguments cannot be NA or NULL")
  }

  # Check that all inputs are numeric
  if (
    !is.numeric(S0) ||
      !is.numeric(K) ||
      !is.numeric(r) ||
      !is.numeric(q) ||
      !is.numeric(tt) ||
      !is.numeric(sigma) ||
      !is.numeric(steps)
  ) {
    stop("All arguments must be numeric")
  }

  # Check that inputs are scalar (length 1)
  if (
    length(S0) != 1 ||
      length(K) != 1 ||
      length(r) != 1 ||
      length(q) != 1 ||
      length(tt) != 1 ||
      length(sigma) != 1 ||
      length(steps) != 1
  ) {
    stop("All arguments must be scalar values")
  }

  # Validate parameter ranges
  if (S0 <= 0) {
    stop("Initial stock price (S0) must be positive")
  }

  if (K <= 0) {
    stop("Strike price (K) must be positive")
  }

  if (tt <= 0) {
    stop("Time to maturity (tt) must be positive")
  }

  if (sigma <= 0) {
    stop("Volatility (sigma) must be positive")
  }

  if (q < 0) {
    stop("Dividend yield (q) must be non-negative")
  }

  if (steps < 1) {
    stop("Number of steps must be at least 1")
  }

  # Check that steps is an integer
  if (steps != floor(steps)) {
    stop("Number of steps must be an integer")
  }

  # Calculate parameters
  dt <- tt / steps # time between steps
  u <- exp(sigma * sqrt(dt)) # Up factor
  d <- exp(-sigma * sqrt(dt)) # Down factor
  p <- (exp((r - q) * dt) - d) / (u - d) # Risk-neutral probability

  # Validate risk-neutral probability (arbitrage-free condition)
  if (p < 0 || p > 1) {
    stop(sprintf(
      "Risk-neutral probability p=%.4f is outside [0,1]. Model parameters lead to arbitrage.",
      p
    ))
  }

  disc <- exp(-r * dt) # Discount factor

  # Vector to store option values
  option_values <- numeric(steps + 1)

  # Backward induction from maturity to root
  for (j in (steps + 1):1) {
    num_nodes <- if (j == steps + 1) steps + 1 else j

    for (i in 1:num_nodes) {
      # Calculate asset price at this node
      asset_price <- S0 * u^(j - i) * d^(i - 1)

      if (j == steps + 1) {
        # At maturity: option value is intrinsic value
        option_values[i] <- max(K - asset_price, 0)
      } else {
        # Before maturity: calculate continuation value and check early exercise
        continuation_value <- (p *
          option_values[i] +
          (1 - p) * option_values[i + 1]) *
          disc
        intrinsic_value <- K - asset_price
        option_values[i] <- max(continuation_value, intrinsic_value)
      }
    }
  }

  return(option_values[1]) # Option price at the root node
}

A naive implementation in R

There are other ways that the convergence can be improved:

  • Alternative parameters (e.g., Jarrow-Rudd, Tian, Leisen-Reimer)
  • Control variates
  • Extrapolation techniques (e.g., Richardson extrapolation)

We just want to focus on computational performance improvements.

Benchmarking the R implementations

With only one option:

s <- 40
k <- 40
v <- 0.30
r <- 0.08
tt <- 0.25
d <- 0
nstep <- 1000

bench_result_one_option_r <- bench::mark(
  ap_naive(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  ap_v1(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  ap_v2(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  ap_v3(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  min_iterations = 10L
)
bench_result_one_option_r
# A tibble: 4 × 6
  expression                             min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
1 ap_naive(S0 = s, K = k, r = r, q = … 212ms  224ms      4.13    23.2MB     14.1
2 ap_v1(S0 = s, K = k, r = r, q = d, … 152ms  154ms      6.18   276.2KB     20.4
3 ap_v2(S0 = s, K = k, r = r, q = d, … 152ms  155ms      6.45   263.2KB     20.6
4 ap_v3(S0 = s, K = k, r = r, q = d, … 151ms  152ms      6.55   760.4KB     21.6

Benchmarking the R implementations

With the sample set of 2500 options.

option_set volatility time_to_maturity asset_price dividend_rate riskless_rate exercise_price
1 0.5574030 0.7583443 101.70338 0.035659105 0.09927446 100
2 0.5685377 1.0953070 108.78327 0.020800169 0.00000000 100
3 0.2430698 0.5483695 120.04294 0.010062093 0.09908713 100
4 0.5152238 0.8569535 90.74575 0.013811304 0.08435938 100
5 0.4208728 2.5816131 107.30398 0.002372436 0.09318627 100
6 0.3595480 0.2790196 109.79172 0.038926066 0.04160950 100
7 0.4682942 0.6898661 96.67050 0.010791985 0.08241817 100
8 0.1673333 0.9885690 114.41327 0.088989330 0.07847599 100
9 0.4284961 0.7498327 92.32567 0.029429095 0.00000000 100
10 0.4525324 4.8019876 72.48192 0.099098847 0.05604909 100
11 0.3288709 0.5439403 97.16279 0.041322451 0.09064361 100
12 0.4595561 2.9412671 121.75869 0.041567527 0.09984472 100

Benchmarking the R implementations

With the sample set of 2500 options - Naive

bench::system_time({
  option_values_naive <- option_parameters |>
    mutate(
      option_value = pmap_dbl(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_naive,
        .progress = TRUE
      )
    )
})
process    real 
  10.2m   10.1m 

Benchmarking the R implementations

With the sample set of 2500 options - R v3

bench::system_time({
  option_values_v3 <- option_parameters |>
    mutate(
      option_value = pmap_dbl(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_v3,
        .progress = TRUE
      )
    )
})
process    real 
  6.72m   6.62m 

Parallel computing in R

The easiest way to reduce computational time is to use parallel computing.

This can be done in several ways, but one of the most straightforward ways to parallelize code in R is to use the furrr package.

The furrr package is a parallel implementation of the purrr package, which uses the future package to provide a simple and consistent API for parallel programming in R.

This allows us to parallelize code on the current machine or on a cluster of machines.

Parallel computing in R

Before doing so, it is always a good idea to create safe versions of the functions to capture potential errors:

ap_naive_safe <- safely(ap_naive)
ap_v1_safe <- safely(ap_v1)
ap_v2_safe <- safely(ap_v2)
ap_v3_safe <- safely(ap_v3)

Parallel computing in R

plan(multisession, workers = 10)
bench::system_time({
  option_values_v3_par <- option_parameters |>
    mutate(
      option_value = future_pmap(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_v3_safe,
        .progress = TRUE
      )
    ) |>
    unnest_wider(option_value, names_sep = "_")
})
plan(sequential)
process    real 
  1.69s   1.18m 
process    real 
  1.26s  49.71s 
process    real 
  1.25s  48.99s 
process    real 
  1.31s  48.97s 

Check all implementations return the same results

The results are identical across all implementations:

identical(
  option_values_naive_par,
  option_values_v1_par
)

identical(
  option_values_v1_par,
  option_values_v2_par
)

identical(
  option_values_v2_par,
  option_values_v3_par
)

Improving performance with Rcpp

Another approach to improve performance is to use a lower-level language like C++.

Again, there are several ways to do this, but one of the most straightforward ways is to use the Rcpp package.

This package allows us to write C++ code, link and compile and call it as if it were an R function. And it makes it easy to use matrix algebra libraries like Armadillo and Eigen.

For more details see: Rcpp

Improving performance with Rcpp

Using AI to convert R to C++

This example has all the necessary components to use AI tools to convert R code to C++ code:

  • an algorithm already implemented in R;
  • a large dataset that can serve as a benchmark;
  • the new code can be inspected to ensure it is correct (the syntax will not be too different from R).

So we can use an AI coding agent (claude-code, gemini-cli, codex, Aider, Cline, etc.) to help us convert the code to C++.

Improving performance with Rcpp

Caution when using AI coding agents

  • Make sure that you have a comprehensive test suite
  • Make sure that the agent does not change the test suite
  • Use agents in YOLO mode (e.g.: claude --dangerously-skip-permissions) at your own peril! Best to use a sandboxed microVM.

Improving performance with Rcpp

Using claude-code to convert R to C++

Improving performance with Rcpp

Using claude-code to convert R to C++

The prompt used:

Please convert the function in @ap_v3.R to Rcpp, the focus is to maintain accuracy and improve performance.

was purposefully kept simple to see how well the model performs with a minimal prompt.

And yet…

Improving performance with Rcpp

Using claude-code to convert R to C++

Improving performance with Rcpp

Using claude-code to convert R to C++

Improving performance with Rcpp

Using claude-code to convert R to C++

Improving performance with Rcpp

Using claude-code to convert R to C++

Improving performance with Rcpp

Using claude-code to convert R to C++

Improving performance with Rcpp

# This function wraps the C++ implementation (ap_v3_cpp) with R-based input
# validation. It provides the same interface as ap_v3 with improved performance.
# The validation code is identical to ap_v3 to ensure the same error checking.

ap_v3_rcpp <- function(S0, K, r, q, tt, sigma, steps) {
  # S0: Initial stock price
  # K: Strike price
  # r: Risk-free interest rate compounded continuously
  # q: Dividend yield compounded continuously
  # tt: Time to maturity in years
  # sigma: Volatility of the underlying stock
  # steps: Number of time steps in the binomial tree

  # Input validation
  # Check for missing arguments
  if (
    missing(S0) ||
      missing(K) ||
      missing(r) ||
      missing(q) ||
      missing(tt) ||
      missing(sigma) ||
      missing(steps)
  ) {
    stop("All arguments must be provided")
  }

  # Check for NA or NULL values
  if (
    any(is.na(c(S0, K, r, q, tt, sigma, steps))) ||
      any(is.null(c(S0, K, r, q, tt, sigma, steps)))
  ) {
    stop("Arguments cannot be NA or NULL")
  }

  # Check that all inputs are numeric
  if (
    !is.numeric(S0) ||
      !is.numeric(K) ||
      !is.numeric(r) ||
      !is.numeric(q) ||
      !is.numeric(tt) ||
      !is.numeric(sigma) ||
      !is.numeric(steps)
  ) {
    stop("All arguments must be numeric")
  }

  # Check that inputs are scalar (length 1)
  if (
    length(S0) != 1 ||
      length(K) != 1 ||
      length(r) != 1 ||
      length(q) != 1 ||
      length(tt) != 1 ||
      length(sigma) != 1 ||
      length(steps) != 1
  ) {
    stop("All arguments must be scalar values")
  }

  # Validate parameter ranges
  if (S0 <= 0) {
    stop("Initial stock price (S0) must be positive")
  }

  if (K <= 0) {
    stop("Strike price (K) must be positive")
  }

  if (tt <= 0) {
    stop("Time to maturity (tt) must be positive")
  }

  if (sigma <= 0) {
    stop("Volatility (sigma) must be positive")
  }

  if (q < 0) {
    stop("Dividend yield (q) must be non-negative")
  }

  if (steps < 1) {
    stop("Number of steps must be at least 1")
  }

  # Check that steps is an integer
  if (steps != floor(steps)) {
    stop("Number of steps must be an integer")
  }

  # Calculate risk-neutral probability for validation
  dt <- tt / steps
  u <- exp(sigma * sqrt(dt))
  d <- exp(-sigma * sqrt(dt))
  p <- (exp((r - q) * dt) - d) / (u - d)

  # Validate risk-neutral probability (arbitrage-free condition)
  if (p < 0 || p > 1) {
    stop(sprintf(
      "Risk-neutral probability p=%.4f is outside [0,1]. Model parameters lead to arbitrage.",
      p
    ))
  }

  # Call C++ implementation
  return(ap_v3_cpp(S0, K, r, q, tt, sigma, steps))
}
// This is the C++ core implementation of the American put option pricing
// algorithm using the Cox-Ross-Rubinstein binomial tree model.
// It is designed to be called from R via the ap_v3_rcpp() wrapper function.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double ap_v3_cpp(double S0, double K, double r, double q,
                 double tt, double sigma, int steps) {
  // S0: Initial stock price
  // K: Strike price
  // r: Risk-free interest rate compounded continuously
  // q: Dividend yield compounded continuously
  // tt: Time to maturity in years
  // sigma: Volatility of the underlying stock
  // steps: Number of time steps in the binomial tree

  // Calculate binomial tree parameters
  double dt = tt / steps;                          // Time between steps
  double u = exp(sigma * sqrt(dt));                // Up factor
  double d = exp(-sigma * sqrt(dt));               // Down factor
  double p = (exp((r - q) * dt) - d) / (u - d);   // Risk-neutral probability
  double disc = exp(-r * dt);                      // Discount factor

  // Vector to store option values at each node
  NumericVector option_values(steps + 1);

  // Backward induction from maturity to root
  // Note: C++ uses 0-based indexing, so j goes from steps to 0
  for (int j = steps; j >= 0; j--) {
    // Number of nodes at this time step
    int num_nodes = (j == steps) ? (steps + 1) : (j + 1);

    for (int i = 0; i < num_nodes; i++) {
      // Calculate asset price at this node
      // Formula: S0 * u^(j-i) * d^i
      double asset_price = S0 * pow(u, j - i) * pow(d, i);

      if (j == steps) {
        // At maturity: option value is intrinsic value
        option_values[i] = std::max(K - asset_price, 0.0);
      } else {
        // Before maturity: calculate continuation value and check early exercise
        double continuation_value = (p * option_values[i] +
                                    (1 - p) * option_values[i + 1]) * disc;
        double intrinsic_value = K - asset_price;
        option_values[i] = std::max(continuation_value, intrinsic_value);
      }
    }
  }

  // Return the option price at the root node
  return option_values[0];
}
library(Rcpp)
Rcpp::sourceCpp(here("R", "ap_v3_rcpp.cpp"))
bench_result_one_option_rcpp <- bench::mark(
  ap_v3(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  ap_v3_rcpp(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  min_iterations = 10L
)
# A tibble: 2 × 6
  expression                           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 ap_v3(S0 = s, K = k, r = r, q … 156.69ms 159.58ms      6.25    7.87KB     23.1
2 ap_v3_rcpp(S0 = s, K = k, r = …   7.95ms   7.98ms    125.    496.13KB      0  

Improving performance with Rcpp

Benchmark with all options

rcpp_v3_time_all_options <- bench::system_time({
  option_values_rcpp <- option_parameters |>
    mutate(
      option_value = pmap_dbl(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_v3_rcpp,
        .progress = TRUE
      )
    )
})
rcpp_v3_time_all_options
process    real 
  20.5s   20.2s 

The results are the same at 15 significant digits.

Improving performance with Rcpp

Parallelization

Since the Rcpp function creates an external pointer to another object in C++, we can only use it in R session where it was created.

So we need to modify the R wrapper function to create the external pointer inside the function. This is what the ap_v3_rcpp_par is doing.

Improving performance with Rcpp

Parallelization results - all options

plan(multisession, workers = 10)
rcpp_v3_par_time_all_options <- bench::system_time({
  option_values_rcpp_par <- option_parameters |>
    mutate(
      option_value = future_pmap(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_v3_rcpp_par_safe,
        .progress = TRUE,
        .options = furrr_options(seed = TRUE)
      )
    ) |>
    unnest_wider(option_value, names_sep = "_")
})
rcpp_v3_par_time_all_options
 process     real 
388.37ms    7.01s 

The best approach would be to create a simple package with the compiled C++ code and then use it in parallel without the need to create the external pointer inside the function.

rcpp_v3_par_time_all_options_pkg
process    real 
175.8ms    2.9s 

Much better!

Improving performance with Julia

An alternative to C++ is to use Julia, a high-level, high-performance programming language for technical computing.

We will be using JuliaCall that allows for a seamless integration between R and Julia.

Improving performance with Julia

Using Claude-code to convert R to Julia

Improving performance with Julia

# R wrapper for Julia implementation of American put option pricing
# Uses JuliaCall package to interface with Julia

# Load required package
if (!require("JuliaCall", quietly = TRUE)) {
  stop(
    "JuliaCall package is required. Install it with: install.packages('JuliaCall')"
  )
}

# Initialize Julia (only needs to be done once per session)
# This function will set up the Julia connection
init_julia_ap <- function(num_threads = "auto") {
  # Set the number of Julia threads before initialization
  # Options: "auto" for all cores, or a specific number like "8"
  Sys.setenv(JULIA_NUM_THREADS = as.character(num_threads))

  # Initialize Julia
  julia_setup()

  # Use here package to reliably find the project root
  if (!require("here", quietly = TRUE)) {
    stop("here package is required. Install it with: install.packages('here')")
  }

  # Source the Julia file from the R directory
  julia_file <- here::here("R", "ap_v3.jl")

  if (!file.exists(julia_file)) {
    stop(paste("Julia file not found:", julia_file))
  }

  julia_source(julia_file)

  # Report thread count
  nthreads <- julia_eval("Threads.nthreads()")
  message(sprintf("Julia initialized with %d threads and ap_v3.jl loaded successfully", nthreads))
}

# R function that calls the Julia implementation
ap_v3_julia <- function(S0, K, r, q, tt, sigma, steps) {
  # Basic input validation (Julia will do more thorough validation)
  if (
    missing(S0) ||
      missing(K) ||
      missing(r) ||
      missing(q) ||
      missing(tt) ||
      missing(sigma) ||
      missing(steps)
  ) {
    stop("All arguments must be provided")
  }

  # Call the Julia function
  result <- julia_call(
    "american_put_binomial",
    S0,
    K,
    r,
    q,
    tt,
    sigma,
    as.integer(steps)
  )

  return(result)
}

# Convenience function that combines initialization and execution
# Use this if you want a single function call
ap_v3_julia_auto <- function(
  S0,
  K,
  r,
  q,
  tt,
  sigma,
  steps,
  force_reinit = FALSE
) {
  # Check if Julia is already initialized
  if (!exists(".julia_initialized", envir = .GlobalEnv) || force_reinit) {
    init_julia_ap()
    assign(".julia_initialized", TRUE, envir = .GlobalEnv)
  }

  ap_v3_julia(S0, K, r, q, tt, sigma, steps)
}

# Vectorized batch function that processes multiple options in Julia using multi-threading
# This is much more efficient than calling ap_v3_julia in a loop or using R parallelization
ap_v3_julia_batch <- function(S0, K, r, q, tt, sigma, steps) {
  # Input validation - all vectors must have the same length
  n <- length(S0)

  if (length(K) != n || length(r) != n || length(q) != n ||
      length(tt) != n || length(sigma) != n) {
    stop("All parameter vectors must have the same length")
  }

  # Handle steps - can be a single value or a vector
  if (length(steps) == 1) {
    steps <- rep(steps, n)
  } else if (length(steps) != n) {
    stop("steps must be either a single value or a vector of the same length as other parameters")
  }

  # Call the Julia batch function
  result <- julia_call(
    "american_put_binomial_batch",
    S0,
    K,
    r,
    q,
    tt,
    sigma,
    as.integer(steps)
  )

  return(result)
}
# American Put Option Pricing using Binomial Tree
# Julia implementation optimized for performance

"""
    american_put_binomial(S0, K, r, q, tt, sigma, steps)

Calculate the price of an American put option using the Cox-Ross-Rubinstein binomial tree model.

# Arguments
- `S0::Real`: Initial stock price
- `K::Real`: Strike price
- `r::Real`: Risk-free interest rate (continuously compounded)
- `q::Real`: Dividend yield (continuously compounded)
- `tt::Real`: Time to maturity in years
- `sigma::Real`: Volatility of the underlying stock
- `steps::Integer`: Number of time steps in the binomial tree

# Returns
- `Float64`: Option price at the root node
"""
function american_put_binomial(S0::Real, K::Real, r::Real, q::Real,
                                tt::Real, sigma::Real, steps::Integer)

    # Input validation
    S0 > 0 || throw(ArgumentError("Initial stock price (S0) must be positive"))
    K > 0 || throw(ArgumentError("Strike price (K) must be positive"))
    tt > 0 || throw(ArgumentError("Time to maturity (tt) must be positive"))
    sigma > 0 || throw(ArgumentError("Volatility (sigma) must be positive"))
    q >= 0 || throw(ArgumentError("Dividend yield (q) must be non-negative"))
    steps >= 1 || throw(ArgumentError("Number of steps must be at least 1"))

    # Calculate parameters
    dt = tt / steps  # time between steps
    u = exp(sigma * sqrt(dt))  # Up factor
    d = exp(-sigma * sqrt(dt))  # Down factor
    p = (exp((r - q) * dt) - d) / (u - d)  # Risk-neutral probability

    # Validate risk-neutral probability (arbitrage-free condition)
    if p < 0 || p > 1
        throw(ArgumentError(
            "Risk-neutral probability p=$p is outside [0,1]. Model parameters lead to arbitrage."
        ))
    end

    disc = exp(-r * dt)  # Discount factor

    # Pre-allocate vector to store option values
    option_values = Vector{Float64}(undef, steps + 1)

    # Backward induction from maturity to root
    for j in (steps + 1):-1:1
        num_nodes = (j == steps + 1) ? steps + 1 : j

        for i in 1:num_nodes
            # Calculate asset price at this node
            asset_price = S0 * u^(j - i) * d^(i - 1)

            if j == steps + 1
                # At maturity: option value is intrinsic value
                option_values[i] = max(K - asset_price, 0.0)
            else
                # Before maturity: calculate continuation value and check early exercise
                continuation_value = (p * option_values[i] +
                                     (1 - p) * option_values[i + 1]) * disc
                intrinsic_value = K - asset_price
                option_values[i] = max(continuation_value, intrinsic_value)
            end
        end
    end

    return option_values[1]  # Option price at the root node
end

"""
    american_put_binomial_batch(S0_vec, K_vec, r_vec, q_vec, tt_vec, sigma_vec, steps_vec)

Calculate prices for multiple American put options using multi-threading.
This is much more efficient than calling american_put_binomial in a loop.

# Arguments
All arguments should be vectors of the same length:
- `S0_vec`: Vector of initial stock prices
- `K_vec`: Vector of strike prices
- `r_vec`: Vector of risk-free interest rates
- `q_vec`: Vector of dividend yields
- `tt_vec`: Vector of times to maturity
- `sigma_vec`: Vector of volatilities
- `steps_vec`: Vector of numbers of time steps

# Returns
- `Vector{Float64}`: Vector of option prices
"""
function american_put_binomial_batch(S0_vec::AbstractVector,
                                     K_vec::AbstractVector,
                                     r_vec::AbstractVector,
                                     q_vec::AbstractVector,
                                     tt_vec::AbstractVector,
                                     sigma_vec::AbstractVector,
                                     steps_vec::AbstractVector{<:Integer})
    # Validate that all vectors have the same length
    n = length(S0_vec)
    if length(K_vec) != n || length(r_vec) != n || length(q_vec) != n ||
       length(tt_vec) != n || length(sigma_vec) != n || length(steps_vec) != n
        throw(ArgumentError("All input vectors must have the same length"))
    end

    # Pre-allocate result vector
    results = Vector{Float64}(undef, n)

    # Use multi-threading to process options in parallel
    Threads.@threads for i in 1:n
        results[i] = american_put_binomial(
            S0_vec[i],
            K_vec[i],
            r_vec[i],
            q_vec[i],
            tt_vec[i],
            sigma_vec[i],
            steps_vec[i]
        )
    end

    return results
end
bench_result_one_option_julia <- bench::mark(
  ap_v3(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  ap_v3_rcpp(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  ap_v3_julia(S0 = s, K = k, r = r, q = d, tt = tt, sigma = v, steps = nstep),
  min_iterations = 10L
)
bench_result_one_option_julia
# A tibble: 3 × 6
  expression                           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 ap_v3(S0 = s, K = k, r = r, q …  155.7ms 157.82ms      6.35    7.87KB     24.1
2 ap_v3_rcpp(S0 = s, K = k, r = …   7.95ms   7.98ms    125.      7.87KB      0  
3 ap_v3_julia(S0 = s, K = k, r =…  10.32ms  10.36ms     95.7     4.26KB      0  

Improving performance with Julia

Benchmark with all options

julia_v3_time_all_options <- bench::system_time({
  option_values_julia <- option_parameters |>
    mutate(
      option_value = pmap_dbl(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_v3_julia,
        .progress = TRUE
      )
    )
})
julia_v3_time_all_options
process    real 
  26.6s   26.2s 

The results are the same at 13 significant digits.

Improving performance with Julia

Parallelization

The Julia runtime needs to be started in each worker so the R wrapper function needs to include that step.

Or we can do the parallelization directly in Julia.

Improving performance with Rcpp

Parallelization results - all options

plan(multisession, workers = 8)
julia_v3_par_time_all_options <- bench::system_time({
  option_values_julia_par <- option_parameters |>
    mutate(
      option_value = future_pmap(
        list(
          S0 = asset_price,
          K = exercise_price,
          r = riskless_rate,
          q = dividend_rate,
          tt = time_to_maturity,
          sigma = volatility,
          steps = nstep
        ),
        ap_v3_julia_auto_safe,
        .progress = TRUE,
        .options = furrr_options(seed = TRUE)
      )
    ) |>
    unnest_wider(option_value, names_sep = "_")
})
# R wrapper for Julia implementation of American put option pricing
# Uses JuliaCall package to interface with Julia

# Load required package
if (!require("JuliaCall", quietly = TRUE)) {
  stop(
    "JuliaCall package is required. Install it with: install.packages('JuliaCall')"
  )
}

# Initialize Julia (only needs to be done once per session)
# This function will set up the Julia connection
init_julia_ap <- function(num_threads = "auto") {
  # Set the number of Julia threads before initialization
  # Options: "auto" for all cores, or a specific number like "8"
  Sys.setenv(JULIA_NUM_THREADS = as.character(num_threads))

  # Initialize Julia
  julia_setup()

  # Use here package to reliably find the project root
  if (!require("here", quietly = TRUE)) {
    stop("here package is required. Install it with: install.packages('here')")
  }

  # Source the Julia file from the R directory
  julia_file <- here::here("R", "ap_v3.jl")

  if (!file.exists(julia_file)) {
    stop(paste("Julia file not found:", julia_file))
  }

  julia_source(julia_file)

  # Report thread count
  nthreads <- julia_eval("Threads.nthreads()")
  message(sprintf("Julia initialized with %d threads and ap_v3.jl loaded successfully", nthreads))
}

# R function that calls the Julia implementation
ap_v3_julia <- function(S0, K, r, q, tt, sigma, steps) {
  # Basic input validation (Julia will do more thorough validation)
  if (
    missing(S0) ||
      missing(K) ||
      missing(r) ||
      missing(q) ||
      missing(tt) ||
      missing(sigma) ||
      missing(steps)
  ) {
    stop("All arguments must be provided")
  }

  # Call the Julia function
  result <- julia_call(
    "american_put_binomial",
    S0,
    K,
    r,
    q,
    tt,
    sigma,
    as.integer(steps)
  )

  return(result)
}

# Convenience function that combines initialization and execution
# Use this if you want a single function call
ap_v3_julia_auto <- function(
  S0,
  K,
  r,
  q,
  tt,
  sigma,
  steps,
  force_reinit = FALSE
) {
  # Check if Julia is already initialized
  if (!exists(".julia_initialized", envir = .GlobalEnv) || force_reinit) {
    init_julia_ap()
    assign(".julia_initialized", TRUE, envir = .GlobalEnv)
  }

  ap_v3_julia(S0, K, r, q, tt, sigma, steps)
}

# Vectorized batch function that processes multiple options in Julia using multi-threading
# This is much more efficient than calling ap_v3_julia in a loop or using R parallelization
ap_v3_julia_batch <- function(S0, K, r, q, tt, sigma, steps) {
  # Input validation - all vectors must have the same length
  n <- length(S0)

  if (length(K) != n || length(r) != n || length(q) != n ||
      length(tt) != n || length(sigma) != n) {
    stop("All parameter vectors must have the same length")
  }

  # Handle steps - can be a single value or a vector
  if (length(steps) == 1) {
    steps <- rep(steps, n)
  } else if (length(steps) != n) {
    stop("steps must be either a single value or a vector of the same length as other parameters")
  }

  # Call the Julia batch function
  result <- julia_call(
    "american_put_binomial_batch",
    S0,
    K,
    r,
    q,
    tt,
    sigma,
    as.integer(steps)
  )

  return(result)
}
julia_v3_par_time_all_options
process    real 
670.8ms   11.7s 

Note that the best approach would be to parallelize the code directly in Julia, but this is just an example of how to use Julia from R in parallel.

julia_v3_batch_time_all_options
process    real 
 27.81s   3.54s 

Much better!

Using torch for GPU computing

Torch is a scientific computing framework with wide support for machine learning algorithms and GPU computing.

PyTorch is the Python implementation and torch for R that mirrors PyTorch in R.

The advantage of using torch is that it provides a simple way to move data and computations to the GPU, it has the advantage of running on different hardware.

Using torch for GPU computing

There are many problems that can benefit from GPU computing, but not all.

And the transition is not always straightforward. The american option pricing problem is a good example of this. See: Chapter 45. Options Pricing on the GPU | NVIDIA Developer

Also consider numerical accuracy when using GPUs, as they may use lower precision arithmetic, the memory that is available, and the overhead of moving data between the CPU and GPU. See: Numerical accuracy — PyTorch 2.9 documentation

Using torch for GPU computing

Parallelization results - all options

# Load required package
if (!require("torch", quietly = TRUE)) {
  stop(
    "torch package is required. Install it with: install.packages('torch')"
  )
}

#' Price multiple American put options using a batched binomial tree on a GPU.
#'
#' This function extends the scalar CRR binomial tree implementation so that many
#' put option contracts can be priced simultaneously. The binomial time recursion
#' remains sequential, but each level operates on the entire batch, allowing the
#' GPU to process wide workloads efficiently. All contracts share the same
#' number of time steps \code{steps}; other parameters may vary by contract.
#'
#' @param S0 Numeric vector of initial stock prices.
#' @param K Numeric vector of strike prices.
#' @param r Numeric vector of continuously compounded risk-free rates.
#' @param q Numeric vector of dividend yields compounded continuously.
#' @param sigma Numeric vector of volatilities.
#' @param tt Numeric vector of times to maturity (years).
#' @param steps Integer scalar with the number of binomial steps (shared by batch).
#'
#' @return Numeric vector of American put option prices, one per contract.
ap_v3_torch <- function(S0, K, r, q, tt, sigma, steps) {
  # --- 1. Device and dtype selection ----------------------------------------------------------
  if (torch::cuda_is_available()) {
    device <- "cuda"
    float_dtype <- torch_double()
    message("CUDA device detected. Using torch_double() (float64).")
  } else if (torch::backends_mps_is_available()) {
    device <- "mps"
    float_dtype <- torch_float()
    message("MPS device detected. Using torch_float() (float32).")
  } else {
    stop("No compatible GPU (CUDA or MPS) available.")
  }

  # --- 2. Input validation and broadcasting ---------------------------------------------------
  if (length(steps) != 1) {
    stop("`steps` must be a single integer shared by the batch.")
  }
  steps <- as.integer(steps)
  if (steps <= 0) {
    stop("`steps` must be a positive integer.")
  }

  lengths <- c(
    length(S0),
    length(K),
    length(r),
    length(q),
    length(sigma),
    length(tt)
  )
  if (any(lengths == 0)) {
    stop("All parameters must have at least one value.")
  }
  batch <- max(lengths)

  broadcast_param <- function(x, name) {
    if (length(x) == 1) {
      rep(x, batch)
    } else if (length(x) == batch) {
      x
    } else {
      stop(sprintf(
        "`%s` must have length 1 or match the batch size (%d).",
        name,
        batch
      ))
    }
  }

  S0_vec <- as.numeric(broadcast_param(S0, "S0"))
  K_vec <- as.numeric(broadcast_param(K, "K"))
  r_vec <- as.numeric(broadcast_param(r, "r"))
  q_vec <- as.numeric(broadcast_param(q, "q"))
  sigma_vec <- as.numeric(broadcast_param(sigma, "sigma"))
  tt_vec <- as.numeric(broadcast_param(tt, "tt"))

  dt_vec <- tt_vec / steps
  if (any(dt_vec <= 0)) {
    stop("All maturities must be positive.")
  }

  u_vec <- exp(sigma_vec * sqrt(dt_vec))
  d_vec <- 1 / u_vec
  denom <- u_vec - d_vec
  if (any(abs(denom) < .Machine$double.eps)) {
    stop("Encountered zero denominator in probability computation.")
  }
  p_vec <- (exp((r_vec - q_vec) * dt_vec) - d_vec) / denom
  if (any(p_vec < 0 | p_vec > 1)) {
    stop("Risk-neutral probability out of bounds; check inputs.")
  }
  down_prob_vec <- 1 - p_vec
  df_vec <- exp(-r_vec * dt_vec)

  # --- 3. Promote scalars to batched GPU tensors ----------------------------------------------
  S0_t <- torch_tensor(S0_vec, device = device, dtype = float_dtype)$view(c(
    batch,
    1,
    1
  ))
  K_t <- torch_tensor(K_vec, device = device, dtype = float_dtype)$view(c(
    batch,
    1,
    1
  ))
  u_t <- torch_tensor(u_vec, device = device, dtype = float_dtype)$view(c(
    batch,
    1,
    1
  ))
  d_t <- torch_tensor(d_vec, device = device, dtype = float_dtype)$view(c(
    batch,
    1,
    1
  ))
  p_line <- torch_tensor(p_vec, device = device, dtype = float_dtype)$view(c(
    batch,
    1,
    1
  ))
  down_prob_line <- torch_tensor(
    down_prob_vec,
    device = device,
    dtype = float_dtype
  )$view(c(batch, 1, 1))
  df_line <- torch_tensor(df_vec, device = device, dtype = float_dtype)$view(c(
    batch,
    1,
    1
  ))
  zero_scalar <- torch_tensor(0, device = device, dtype = float_dtype)

  # --- 4. Precompute stock prices and exercise values (batched) -------------------------------
  j_seq <- torch_tensor(0:steps, device = device, dtype = float_dtype)
  j_indices <- j_seq$view(c(1, steps + 1, 1))
  n_seq <- torch_tensor(0:steps, device = device, dtype = float_dtype)
  n_indices <- n_seq$view(c(1, 1, steps + 1))

  S_all <- S0_t * (u_t$pow(j_indices)) * (d_t$pow(n_indices - j_indices))

  # Compute put payoffs: max(K - S, 0)
  ex_all <- torch_maximum(K_t - S_all, zero_scalar)

  # --- 5. Backward induction across the batch -------------------------------------------------
  V <- ex_all$narrow(3, steps + 1, 1)

  for (n in (steps - 1):0) {
    n_nodes <- n + 1

    V_up <- V$narrow(2, 2, n_nodes)
    V_down <- V$narrow(2, 1, n_nodes)
    V_hold <- df_line * (p_line * V_up + down_prob_line * V_down)

    V_ex <- ex_all$narrow(2, 1, n_nodes)$narrow(3, n + 1, 1)
    V <- torch_maximum(V_hold, V_ex)
  }

  # --- 6. Return one price per contract -------------------------------------------------------
  as.numeric(V$squeeze())
}
# Batch size = 500 (5 batches)
torch_v3_time_batch_500 <- bench::system_time({
  option_values_torch_500 <- process_torch_batches(
    option_parameters,
    batch_size = 500,
    nstep = nstep
  )
})
torch_v3_time_batch_500
process    real 
  24.4s   24.1s 

They are only equal up to 4 significant digits due to differences in floating point arithmetic on the GPU. On a mac the GPU uses 32-bit floats by default which limits precision.

Summary

Results for the calculation of 2500 american put options in the test sample:

approach speedup total time
Naive R 1.000000 10.09m
V3 R 1.523794 6.62m
V3 R parallel 12.357288 48.97s
V3 Rcpp parallel 86.267275 7.01s
V3 Rcpp parallel (package) 208.490073 2.9s
V3 Julia (parallelization in R) 51.543424 11.74s
V3 Julia (parallelization in Julia) 170.829701 3.54s

HDFE regression

HDFE regression

Worksop materials for this part are available at:

https://github.com/reisportela/cp2hdfe